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Abstract 

This introduction to Bayesian statistics presents the 
main concepts as well as the principal reasons advo- 
cated in favour of a Bayesian modelling. We cover 
the various approaches to prior determination as well 
as the basis asymptotic arguments in favour of using 
Bayes estimators. The testing aspects of Bayesian 
inference are also examined in details. 
Keywords: Bayesian inference, Bayes model choice, 
foundations, testing, non-informative prior, Bayesian 
nonparametrics, Bayes factor 



1 Introduction 
paradigm 



the Bayesian 



In this Chapter we give an overview of Bayesian 
data analysis, emphasising that it is a method for 
summarising uncertainty and making estimates and 
predictions using probability statements conditional 



on observed data and an assumed model ( Gelman 



2008 ) — which makes it valuable and useful in Statis- 



tics, Econometrics, and Biostatistics, among other 
fields. 

We first describe the basic elements of Bayesian 
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analysis. In the following, we refrain from embarking 
upon philosophical discussions about the nature of 



knowledge (Robert 



2001 Chapter 10) and the possi- 



bility of induction (Popper and Miller 1983[), opting 



instead for a mathematically sound presentation of a 
statistical methodology. We indeed believe that the 
most convincing arguments for adopting a Bayesian 
version of data analyses are in the versatility of this 
tool and in the large range of existing applications. 

1.1 First principles 

Recall that, given a set of observations x <E X, a 
statistical model is defined as a family of probabil- 
ity distributions on X, say (Pg,6 £ 6) and the aim 
of statistical inference is to derive quantitative infor- 
mation about the unknown parameter 9. This in- 
formation can be about explanatory features of the 
model, like the impact of the increase by one point 
of interest rates over inflation rate or the relevance 
of culling strategies during the latest foot-and-mouth 
epidemics in the UK or yet the amount of cold dark 
matter in the Universe, or about predictive features, 
like the value of a particular stock the next day or the 
chances for a given individual of catching the H5N1 
flu over the coming three mouths. Inference is quan- 
titative in that it provides numerical values for the 
quantities of interest and numerical evaluations of the 
uncertainty surrounding those values as well. 

Since all models are approximations of the real 
World, the choice of a sampling model is wide-open 
for criticisms: Bayesians promote the idea that a mul- 
tiplicity of parameters can be handled via hierarchical, 



typically exchangeable, models (Gelman 2008). This 



is however a type of criticism that goes far beyond 
Bayesian modelling and questions the relevance of 
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completely built models for drawing inference or run- 
ning predictions. 

The central idea behind Bayesian modelling is that 
the uncertainty on the unknown parameter 9 is better 
modelled as randomness and consequently a probabil- 
ity distribution II is constructed on 9. In particular 
Pg then represents the probability distribution of the 
observation x given that the parameter is equal to 
9, i.e. the conditional probability distribution of x 
given 9. If II is a probability on 0, with density 7r 
with respect to some measure v on 0, then we can 
define a joint distribution for the observation and the 
parameter (x, 9) 

P n ((x, 9)eAxB)= [ P e {A)iT(9)&v{9). 
JeeB 

For the sake of simplicity we consider only models 
{Pg,9 £ 0) that allow for a dominating measure, \jl 
(say the Lebesgue measure), and we denote by f(.\9) 
the density of Pg with respect to fi (the likelihood). 
Then the joint distribution of (x,9) has density 



p„(x,O) = f(x\O)n(0), 



(1) 



with respect to /i x v. Using Bayes theorem we can 
define the distribution of the parameter 9 given the 
observations by its density with respect to v. 



tt(9\x) 



f(x\9)n(9) 



f @ f(x\6)7r(6)dv(6y 



(2) 



and denote the denominator by 



m(x) 



f(x\9)ir(9)dv(9) 



The probability II (it, respectively) on 8 is called the 
prior distribution (density, respectively), the condi- 
tional probability ^ of 9 given x is called the pos- 
terior distribution (density, respectively) and m(x) 
is the marginal density of the observation x. Then, 
Bayesian analysis is based entirely on the posterior 
distribution for all inferential purposes, e.g. to 
draw conclusions on the parameter 9 or on some func- 
tions of the parameter 9, to make predictions, to test 
the plausibility of a hypothesis or to check the fit of 
the model. 



There are many arguments which make such an ap- 
proach compelling. Without entering into philosoph- 
ical and epistemological arguments on the nature of 



Science (Jeffreys 1939 MacKay 2002 Jaynes 2003) 



we briefly state what we view as the main practi- 
cal appealing features of introducing a prior prob- 
ability on 9. First such an approach allows to in- 
corporate prior information in a natural way in the 
model, as explained in Section [2] second, by defin- 
ing a probability measure on the parameter space 0, 
the Bayesian approach gives a proper meaning to no- 
tions such as the probability that 9 belongs to a spe- 
cific region which are particularly relevant when con- 
structing measures of uncertainty like confidence re- 
gions or when testing hypotheses. Furthermore, the 
posterior distribution ^ can be interpreted as the 
actualisation of the knowledge (uncertainty) on the 
parameter after observing the data. We stress that 
the Bayesian paradigm does not state that the model 
within which it operates is the "truth" , no more that 
it believes that the corresponding prior distribution it 
requires has a connection with the "true" production 
of parameters (since there may even be no parame- 
ter at all). It simply provides an inferential machine 
that has strong optimality properties under the right 
model and that can similarly be evaluated under any 
other well-defined alternative model. Furthermore, 
the Bayesian approach is such that techniques allow 
prior beliefs to be tested and discarded as appropriate 



(Gelman 2008), in agreement that the overall prin- 



ciple that a Bayesian data analysis has three stages: 
formulating a model, fitting the model to data, and 
checking the model fit ( Gelman|2008 ), so there seems 
to be little reason for not using a given model at 
an earlier stage even when dismissing it as "un-true" 
later (always in favour of another model) . 

In the above formulation, note that can be en- 
dowed with quite different features: it can be a finite 
dimensional set (as in parametric models), an infi- 
nite dimensional set (as in most semi/non parametric 
models) or a collection of various sets with no fixed 
dimension (as in model choice). 

As an example, consider the following contin- 
gency table on survival rate for breast-cancer patients 
with or without malignant tumours, extracted from 



Bishop et al. (19751, the ultimate goal being to dis- 
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tinguish between both types of tumour in terms of 
survival probability: 

survival 
age malignant yes no 
under 50 no 77 10 

yes 51 13 
50-69 no 51 11 

yes 38 20 
above 70 no 7 3 

yes 6 3 
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Figure 1: Representation of two gamma poste- 
rior distributions differentiating between malignant 
(dashes) versus non-malignant (full) breast cancer 
survival rates. 

Then if we assume that both groups (malignant 
versus non-malignant) of survivors are Poisson dis- 
tributed V(N t 9), where N t is the total number of 
patients in this age group, i.e. 



f(x t \0,N t ) 



-SN 



x G N , 



then we obtain a likelihood 
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L{6\D) = Y[(0N t )** exp{-£W t } 



t=i 



which, under an exponential 9 ~ £xp(2) prior- 
whose rate 2 is chosen here for illustration purposes- 
, leads to the posterior 



7T(0|£>) oc 



Xi+X 2 +X 3 



exp{-9(2 + N 1 +N 2 + N 3 )} 



i.e. a Gamma T(x! + x 2 + x 3 + 1, 2 + Ni + N 2 + N 3 ) 
distribution. The choice of the exponential parameter 
corresponds to a 50% survival probability. In the case 
of the non-malignant breast cancers, the parameters 
of the Gamma distribution are a = 136 and b = 161, 
while, for the malignant cancers, they are a = 96 and 
b = 133. Figure [l] shows the difference between those 
posteriors. 

1.2 Extension to improper priors 

In many situations, it is useful to extend the above 
setup to prior measures that are not probability dis- 
tributions but cr-finite measures with infinite mass, 
i.e. 

7r(0)di;(0) = +oo, 

e 



since, provided that 



f(x\6)w(0)dv(6) < +oo, 



(3) 



almost everywhere (in x), the quantity ^ is still 
well-defined as a probability density as when using a 
regular posterior probability as prior (Hartiga n|1983 



Berger 1985 Robert 2001 ). Such extensions are justi- 



fied for a variety of reasons, ranging from topological 
coherence — limits of Bayesian procedures often par- 



take of their optimality properties (Wald 1950) and 



should therefore be included in the range of possible 
procedures — to robustness — a measure with an infi- 
nite mass is much more robust than a true probabil- 
ity distribution with a large variance — and improper 
priors are typically encountered in situations where 
there is little or no prior information, inducing flat, 
i.e. uniform, distributions on the parameter space (or 
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on some transforms of the parameter space). Indeed 
it is quite common, for complex models, to have little 
or no information on some of the parameters present 
in the model and using improper priors for such pa- 
rameters has many advantages. Note however that in 
such cases the marginal density m(x) does not define 
a probability on X (and that the existence condition 
([3]) needs to be checked). This drawback has an im- 
portance consequence for Bayesian model comparison 
as explained in Section [5j 

Note also that some improper priors never allow 
for well-defined posteriors, no matter how many ob- 
servations there are in the sample. One such example 
is when the prior is ir(8) = exp(+# 2 ) and the obser- 
vations are iid Cauchy. Another and less anecdotic 
example occurs in mixture models, under exchange- 



prior 7r on 9 G 9 and a loss function L), the (opti- 
mal) Bayesian procedure (estimator) is then defined 
as the decision function S minimising the integrated 
risk r(n, d): 



<F = argmin 5eX ,r(7r, 6) 



where 



r(n, 5) 



L(9, 6(x))f(x\8)<K(6)dfx{x)dv(6). 



Such a procedure is called a Bayes estimator. Using 
the fact (Robert 20011), that such estimators can be 



computed pointwise as minimising the posterior risk: 
Vx G X, 



able improper priors on the components dLee et all S*{x) = argmin 5eX , / L(9,5(x))ir(9\x)dv(9) 



2008) 



1.3 Bayesian decision theory 

As a general modus vivendi, let us first stress that 
inference as a whole is meaningless unless it is eval- 
uated. The evaluation of a statistical procedure, i.e. 
determining how well or how bad the inference per- 
forms, requires the definition of a comparison crite- 
rion, called a loss function. Set V the set of all pos- 
sible results of the inference (corresponding to the 
decision set in game theory). An estimator is then 
a function from X into T>. (With an obvious abuse 
of notation, we will also use T> for the set of estima- 
tors.) For instance, the aim is to estimate 9, then 

V = 9; if the aim is to test for some hypothesis, then 

V = {0, 1}, and T> = X\ the set of a future obser- 
vation if the aim is to predict a future observation. 
A loss function L is a function on 9 x T>, express- 
ing what the loss (cost) is for considering a decision 
S when 9 is the true value. Typical (formal) loss 
functions used for estimation and test are quadratic 
losses (L(9,S) — \\9 — <5|| 2 ) and 0-1 losses (1 if deci- 
sion is wrong, if it is right), respectively. Other loss 
functions can (should) be constructed, depending of 
the problem at hand, and they are strongly related to 
the notion of utility function encountered in economy 
and game theory ( |Berger||l985l ) . 

Given a statistical model (Pq,0 G 9) on x E X, a 



it is possible to derive explicit expression of Bayes 
estimates for many common loss functions. In par- 
ticular, the Bayes estimator associated with the 
quadratic loss and the posterior distribution 7r(.|x) 
is the posterior mean 

5*(x) = I 9ir{9\x)d9. 

Note that the integrated risk r(7r, S) can also be 
expressed as f & R(0, S)n(9)dv(9), where R(9,S) = 
J x L(9,S(x))f(x\9)dfi(x) is the frequentist risk, so 
that Bayes estimates are also often optimal in the 
frequentist sense. (It can be shown in particular that 
any admissible estimator is the limit of Bayes estima- 
tors, see |Berger|[l985| or |Robe"rt1|2001 ). 



2 On the selection of the prior 

A critical aspect is the determination of the prior dis- 
tribution 7r and its clear influence on the inference. It 
is straightforward to come up with examples where 
a particular choice of the prior leads to absurd de- 
cisions. Hence, for a Bayesian analysis to be sound 
the prior distribution needs to be well-justified. Be- 
fore entering into a brief description of the various 
ways of constructing prior distributions, note that 
as part of model checking, it is necessary in every 
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Bayesian analysis to assess the influence of the choice 
of the prior, for instance through a sensitivity anal- 
ysis. Since the prior distribution models the knowl- 
edge (or uncertainty) prior to the observation of the 
data, the sparser the prior information is, the flatter 
the prior should be. The advantage of incorporating 
prior information via a prior distribution is rather 
universally accepted and we therefore first describe 
ways of eliciting prior distributions from prior knowl- 
edge. 

2.1 Elicited priors 

The elicitation of prior distributions from prior 
knowledge consists in the construction of the prior 
probability ir(9) using all items of prior informa- 
tion available to the modeller. This prior informa- 
tion may come from expert opinions or from biblio- 
graphic data or yet from earlier analyses, as in meta- 
analysis. There exists a vast literature on prior elic- 
itation based on expert opinions, which is a much 
more complex process than is usually acknowledged 
in most Bayesian statistical notebooks, see Section 2 
of this book for a more complete discussion on prior 
elicitation based on expert opinions. 

In particular the prior information is rarely rich 
enough to entirely define a prior distribution, there- 
fore it is customary to choose a prior distribution 
within a parametric class of possible distributions: 
7r(0|7), where 7 € T is called a hyperparameter. 
In such cases the prior information is summarised 
through the choice of 7. For instance, | Albert et al.| 
( 2008 ) use bibliographic prior information to con- 



struct a prior distribution on the probability of 
cross-contamination from a contaminated broiler in 
a household, say p. The prior distribution of p is 
assumed to be a Beta Be(a, b) distribution, 



n(p\a, b) ocp a 1 (1 - p) 



6-1 



<p< 1 



and the parameters (a, b) of the Beta distribution 
are assessed using two cross-contamination models 
in the literature which lead to a probability of trans- 
fer between 1/3 and 2/3, which was translated into 
a Beta(8, 8) prior on p, as it corresponds to a prior 
mean of 0.5 and to a 95% prior confidence interval 



equal to (0.27,0.73). See also Dupuis (1995) for an 
example of expert elicitation of the Beta parameters 
on some capture and survival probabilities in a lizard 
population, or the Chapter of Bocker, Crimmi and 
Fink in this volume where beta priors are elicited to 
model correlations between risk types. 

2.2 Conjugate priors 

Among the possible parametric families 7r(#|7), 7 S 
r, conjugate priors form appealing parametric fami- 
lies, merely for computational reasons ( |Berger|[l985| 
Robert|2001 1 . A family of distribution prior distribu- 
tions w(6\j), is said to be conjugate to the likelihood 
f{x\0) if the posterior also belongs to the same family, 
i.e. when the prior is equal to tt(6\jo) then there ex- 
ists a 7(2;, 70) £ r such that the posterior is equal to 
7r(#|7(i£, 7o))- The actualisation of the information 
due to observing the data x is then modelled as a 
change of hyperparameter from 70 to 7(2;, 70). Expo- 
nential families (as models for the observation x) are 
almost in one-to-one correspondence with sampling 
distributions allowing for conjugate priors. As an ex- 



ample, Carlin and Louis (2001) consider an observed 



random variable X that is the number of pregnant 
women arriving at a given hospital to deliver their 
babies within a given month, which they model as a 
Poisson V{&) distribution with parameter > 0. A 
conjugate family of priors for the Poisson model is 
the collection of gamma distributions T(a, &), since 



a-l+x -(6+1) 



f(x\6)Tr{6\a,b) oc 



leads to the posterior distribution of 8 given X = x 
being the gamma distribution F(a + x, b + 1). The 
computation of estimators, of confidence regions — 
called credible regions within the Bayesian literature 
to distinguish the fact that those regions are evalu- 
ated on the parameter space rather than on the ob- 
servation space (Berger 1985) — or of other types of 



summaries of interest on the posterior distribution 
then becomes straightforward. For instance in the 
above Poisson-Gamma example, the Bayesian esti- 
mator of the average number of arrivals, associated 
with the quadratic loss, is given by 6 = (a+x)/(b+l), 
the posterior mean. 
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The apparent simplicity of conjugate priors should 
however not make them excessively appealing, since 
there is no strong justification to their use. One of the 
difficulties with such families of priors is the influence 
of the hyperparameter 7 . If the prior information is 
not rich enough to justify a specific value of 7, fixing 
7 = 70 arbitrarily is problematic, since it does not 
take into account the prior uncertainty on 70 itself. 
To improve on this aspect of conjugate priors, a usual 
fix is to consider a hierarchical prior, i.e. to assume 
that 7 itself is random and to consider a probability 
distribution with density q on 7, leading to 



0| 7 

7 



7T(0| 7 ) 
9(7) > 



as a joint prior on 9, 7). The above is equivalent to 
considering, as a prior on 9 

tt(0) = j 7r(0|7)g(7)d7 . 

Obviously q may also depend on some hyperparame- 
ters rj. Higher order levels in the hierarchy are thus 
possible, even though the influence of the hyper(- 
hyper-)parameter r\ on the posterior distribution of 
9 is usually smaller than that of 7. But multiple 
levels arc nonetheless useful in complex populations 



as those found in animal breeding (]S0rensen and Gi- 
Emolap002 ). 



In many applications prior information is quite 
vague or at least vague enough on some parts of the 
model, in which case it is important to derive priors 
that have desirable properties and that are as little 
arbitrary or subjective as possible. Such construc- 
tions are commonly called non informative. While 
this denomination is misleading, and should be re- 
placed by the less judgemental reference prior de- 
nomination, we nonetheless follow suit and use it in 
the following subsections, since it is the most com- 



mon denomination found in the literature ( Kass and| 
Wasserman||1996 ). 



2.3 Non informative priors 

Non informative priors are expected to be flat dis- 
tributions, possibly improper. An apparently natu- 



ral way of constructing such priors would be to con- 
sider a uniform prior, however this solution has many 
drawbacks, the worst one being that it is not invari- 
ant under a change of parameterisation. To under- 
stand this consider the example of a Binomial model: 
the observation x is a B(n,p) random variable, with 
p G (0, f ) unknown. The uniform prior 7r(p) = 1 
could then sound like the most natural non informa- 
tive choice; however, if, instead of the mean parame- 
terisation by p, one considers the logistic parameter- 
isation 9 = log(p/(l — p)) then the uniform prior on 
p is transformed into the logistic density 

Tr{9) = e 9 /(l + e ) 2 

by the Jacobian transform, which obviously is not 
uniform. 

To circumvent this lack of invariance per reparam- 
eterisation, Jeffreys (1939) proposed the following 
choice now known as Jeffreys ' prior 



7T(0) 



y/\m, 



(4) 



where i(9) is the Fisher-information matrix and \i(9)\ 
denotes its determinant. The above construction 
is obviously invariant per reparameterisation and 
has many other interesting features specially in one- 



dimensional setups (see Robert et al. 2009 for a re- 
assessment of Jeffreys' impact on Bayesian statis- 
tics). In particular, in the one-dimensional parameter 
case, the Jeffreys prior is also the matching prior (see 
Chapters 3 and 



Robert 2001 



and the reference 



prior defined by Bernardo (Bernardo 1979 Clarke 



and Barron 1990). For instance, when Pg is a lo- 
cation family, i.e. when f(x\9) — g(x — 9), the Fisher 
information is constant and thus the Jeffreys prior is 
tt(9) = 1. Note that in many cases like the above the 
Jeffreys prior is improper. 

In multivariate setups, Jeffreys' construction is 
not so well-justified and it may lead to not-so- well- 
behaved priors. A famous example is the Neyman- 
Scott problem where two groups of observations 
are such that in each group all observations are 
distributed from 

3 = 

""(Ml 



% i3 



H{^,a 2 ), i = 1,... 



n, 

1,2. In this case Jeffreys' prior is given by 
cr) cx a~( n+1 \ and the Bayes estimator 



of a associated with the quadratic loss function is 
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equal to 



E n [a 2 



%1A, ■••) %n,2 



E 



4(n 



-1) 



which converges to a 2 / 2 as n goes to infinity, thus 
leading to an inconsistent estimate. Although this 
seems like an artificial example it is actually of wider 
interest, since in the normal linear regression model 
Jeffreys' prior is proportional to a~ p ~ 2 where p is 
the number of covariates. This dependence on p 
makes it rather unappealing, even though the alter- 



native g-prior of Zellner ( 1986 1 discussed below suf- 
fers from the same drawback. Another standard ex- 
ample discussed further in Section [4] is when estimat- 
ing | \9\ | 2 when 9 is the n-dimensional mean of an 
n-dimensional normal vector. 

The ultimate attempt to define a non informative 



prior is in our opinion Bernardo's ( 1979 ) definition 



through the information theoretical device of Kull- 
back divergence (see also |Berger and Bernardo||1992 



Berger et al.|20 09). The idea is to split the param- 



or 

eter into groups say (0(i), ■■■■>8(p)) where 9n) is more 
interesting than 0(2), which is more interesting than 
#(3) and so on. This can be seen as a generalisation 
of the usual splitting into a parameter of interest and 
a nuisance parameter. Then the Bernardo's reference 
prior is constructed iteratively as some sorts of Jef- 



freys' priors in each of the submodels, see also Robert 



(2001) for a more precise description of the iterative 



construction. Quite obviously, this is not the unique 
possible approach, it depends on a choice of infor- 
mation measure, does not always lead to a solution, 
requires an ordering of the model parameters that 
involves some prior information (or some subjective 
choice) but, as long as we do not think of those refer- 
ence priors as representing ignorance ( Lindley|1971 ), 
they can indeed be taken as reference priors, upon 
which everyone could fall back when the prior infor- 



mation is missing (Kass and Wasserman|1996 1. 



2.4 Some asymptotic results 

A well-known phenomenon is the decrease of influ- 
ence of the prior as the sample size (or the informa- 
tion in the data) increases. We shall recall here these 



results in the simpler case of i.i.d observations, how- 
ever these results can be extended to non i.i.d. cases 
such as dependent observations under stationary and 
mixing properties, Gaussian processes and so on. 
Generally speaking in most parametric cases, the pos- 
terior distribution concentrates towards the true pa- 
rameter value as n goes to infinity so that posterior 
estimates will converge to the true values, as n goes to 
infinity. This first type of results ensures that point 
estimates are satisfactory, as far as asymptotic con- 
vergence is concerned. 

Another important aspect of the asymptotic anal- 
ysis of Bayesian procedures is to understand how the 
measures of uncertainty derived from the posterior 
can be related to frequentist measures of uncertainty. 
Such a relation can be deduced from the Bernstein 
Von Mises property, which can be stated in the fol- 
lowing way: Assume that the vector of observations 
x = (xi, x n ) := x n is made of i.i.d observations 
from a distribution f(.\6), which is regular, see for in- 
stance [GhosF^dRam^moorthi] ([2003]) for more pre- 
cise conditions, and let 7r be a prior density, which is 
positive and continuous on 9, then the posterior dis- 
tribution can be approximated in the following way, 
when n goes to infinity: for all A c 



eA\x n ksP Affaire)- 1 ) e A 



where 9 is the maximum likelihood estimator and 
i\(9) is the Fisher information matrix per observa- 
tion calculated at 9 = 9. In other words the pos- 
terior distribution resembles a Gaussian distribution 
centred at 9 with covariance matrix i~ l (9)/n, when 
n is large. 

This result has many interesting implications. The 
first consequence is that, to first order, the influence 
of the prior disappears as n goes to infinity. It also al- 
lows for quick approximate computations in the case 
of large samples, and it implies that to first order 
Bayesian and frequentist inference (based on the like- 
lihood) essentially give the same answers. Although 
devising procedures giving the same answers as fre- 
quentist procedures is not an ultimate aim of the 
Bayesian analysis, it is of importance to ensure that 
Bayesian procedures ultimately have also good fre- 
quentist properties. The asymptotic equivalence be- 
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tween the Bayesian and the frequentists answers (to 
first order) hold in wide generality for finite dimen- 
sional models. When the dimension of the parameter 
grows with the number of observations or is infinite, 
then this is often not true anymore, see for instance 



Freedman ( f 999 ) and Rivoirard and Rousseau ( 2009 1 



Although these asymptotic results have a strong 
frequentist flavour, in the sense that they are ob- 
tained by assuming that there is a fixed true param- 
eter 6q and as new data comes in the posterior con- 
centrates around the true parameter like a Gaussian 
distribution, they are also appealing from the sub- 
jectivists points of view where probabilities represent 
degrees of belief and there are no objective probability 



model, see Diaconis and Freedman ( 1986 ) for a more 



precise discussion on this issue. 

3 Measures of uncertainty: 
credible regions 

Recall that the whole inference about 9 is deduced 
from the posterior distribution, w(6\x), including es- 
timates as major summaries, but the posterior distri- 
bution gives us much more information than simply 
point estimates. In particular, different measures of 
uncertainty can be derived from the posterior and 
among the various measures credible regions are the 
most popular. A set C C is an a - credible region 
if and only if 



P 71 [0 e C\x] > 1 - a. 



(5) 



Contrariwise to frequentist confidence regions, the 
notion of coverage probability is directly understood 
as a probability on 9 and is therefore straightforward 
to interpret. Among all credible regions defined by 
([5]) , those having minimal volume are particularly in- 
teresting. It turns out, see Robert (2001), that they 



are defined as highest posterior density (HPD) re- 
gions: 

cz = {e-Me)f(x\e)>k a (x)} 

where k a (x) is the largest value such that 
P* [9eCl\x] >l-a. 



(Note that we define the bound k a (x) in terms of 
the product prior x likelihood in order to bypass the 
difficulty with the normalising constant m(x).) 

Although the analytic determination of k a {x) is 
often challenging, the approximation of this bound 
based on a sample from tt(9\x), 9^' , . . . ,6&' , can 
be easily derived from an ordering of the values 
^(9^)f(x\9^) as the corresponding (1 — a)-th quan- 
tile. For instance, if a Poisson X ~ V(&) count is 
associated with a Gamma T(a,b) prior, the posterior 
T(a + x, b + 1) leads to the HPD region 



{9:9 a 



exp(-(6 + l)0) > k a (x)} 



whose determination requires a numerical construct. 
On the other hand, if a sample 0W, . . . ; q(p) from 
the posterior T(a + x,b + 1) is available, then the 
HPD bound k a (x) can be estimated as the (1 — a)-th 
quantile of the values [0(O]*+*-i e xp(-(6 + l)0«)'s. 
Figure [2] illustrates a similar derivation in the case 
of a normal N{9 1 a 2 ) model with both parameters 
unknown. 

Credible regions have nice interpretations and are 
optimal under a volume criterion, as Bayesian esti- 
mators of the confidence sets C . In a wide general- 
ity, they further attain good frequentist coverage in 
the sense that V g (9 £ C) = 1 - a + O^" 1 / 2 ) for 
most prior distributions ir, where n denotes the sam- 
ple size fWelch and Peers|1963||Robert|2001[ Chapter 
5) . Credible regions however suffer from a lack of in- 
variance to changes of parameterisation, i.e. if 9 is a 
given parameterisation of interest and is the HPD 
region constructed as above, then if 77 = g(9) is an- 
other parameterisation, g(C£) — {n — g(9);8 G C^} 
is not necessarily the HPD region for the 77 parame- 
terisation (see Druilhet and Marin|2007 for a detailed 
analysis of this phenomenon). 



Nuisance parameters 
grated likelihood 



inte- 



In many applied problems, one is only interested in 
some components of the parameter, the remaining 
part of the parameter being then called the nuisance 
parameter. This distinction opposes the parameter of 
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-1 



Figure 2: Representation of a Gibbs sample of 10 3 
values of (9, a 2 ) for the normal model, x%, . . . ,x n ~ 
Af(0, a 2 ) with x — 0, s 2 = 1 and n = 10, under 
Jeffreys' prior, along with the pointwise approxima- 
tion to the 10% HPD region (in darker hues) (Source: 
Robert and Wraith||2009 ). 



interest, say ip within 8 — (tp,X), where ip is the pa- 
rameter of interest and A is the nuisance parameter. 
Dealing with nuisance parameters is quite problem- 
atic in a frequentist framework, whether one is inter- 
ested in parameter estimation, in confidence regions 
determination or in testing. Likelihood approaches 
need to define proper likelihoods for ip, which in com- 
plete generality is not possible. Hence, they use ap- 
proximations and modifications of proper likelihoods 
such as partial likelihoods or modified profile likeli- 
hoods, see Severini (2000) for a more complete dis- 



cussion on these issues. 

On the opposite, the Bayesian framework offers is a 
most natural way of dealing with nuisance parameters 
and for defining proper profile likelihoods : integrat- 
ing out the nuisance parameter. In other words the 
Bayesian marginal likelihood for ip under the prior 
n(X\ip) is given by 



U(x\ip) 



/(a#,A)d7r(A|</>). 



(0) 



This approach offers many advantages: (1) If the con- 
ditional prior w(X\ip) is proper, then f„(x\ip) as de- 
fined in Q is a proper likelihood, in the sense that it 
is the density of x under some model parameterised 
by ip alone; (2) Integrating A out implicitly takes into 
account the uncertainty on A, contrary to the profile 
likelihood, or to any other kind of plug-in likelihood 
defined by f (x\ip , X^) , where X\p is some estimate of 
A given ip. In particular uncertainty measures de- 
rived from f 7! (x\ip) are not biased downwards due to 
the replacement of A by X^p. Hence there is no need 
to correct further for this uncertainty, which is usu- 
ally necessary when dealing with plug-in likelihoods, 
leading to penalised likelihoods. This is of particu- 
lar interest in model selection, when the parameter 
of interest is the model itself, as discussed in Section 

El 

However, if ir(X\ip) is an improper prior, then 
f w (x\tp) is not necessarily a likelihood, in particular 
J x f„(x\6)dx = +oo may occur. A well-known ex- 
ample of such misbehaviour is the case of the so-called 
marginalisation paradoxes, see for instance Robert 
(2001, Chapter 3). As another example of badly 
behaved marginal likelihood, consider the case pre- 



sented in Robert (2001, Chapter 3) and Liseo (2006) 
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where the observations Xi ~ 1), i = 1, • •-,£>, are 

independent and where the parameter of interest is 
i> = WfA\ 2 /P = Ef=i Mi/P where ^ = (/ii, and 
the nuisance parameter is A = the direction 

of the vector jj,. A natural flat prior on A is the uni- 
form distribution on the p-dimensional sphere for A 
and the scale prior n(ip) = l/v^, leading to a well- 



behaved marginal likelihood, see Berger et al. ( 1998b I 



for precise calculations. However if one considers in- 
stead the Jeffreys prior on /i, i.e. 7r(^i) = 1, then 
the posterior distribution of tp is a chi-square distri- 
bution with p degrees of freedom and non-centrality 
parameter ||x|| 2 , which is not a well-behaved poste- 
rior. In particular the posterior mean of A is equal to 
%j) = ||a;|| 2 /p + 1 and satisfies \jj — ip — > 2 as p goes to 
infinity. 

The above examples do not imply that one should 
not use improper priors on nuisance parameters, since 
in most cases little information is known on those 
parameters. Rather they show that one needs to 
be quite careful in selecting improper priors in such 
cases. The construction of Bernardo's (1979) ref- 



erence priors is particularly relevant in such frame- 
works. 

In the following section, we describe Bayesian test- 
ing and Bayesian model comparison or model selec- 
tion. It is to be noted that model selection can be 
viewed as a specific example of nuisance parameter 
framework, where the parameter of interest is the 
model and the nuisance parameters are the param- 
eters in each model. 



5 Testing versus model com- 
parison 

5.1 Bayes factors 

The most standard Bayesian answer to a testing 
problem for hypotheses written as Ho : 9 E Oo for 
the null and as Hi : 8 E O x for the alternative, is 
the Bayesian estimate corresponding to the 0-1 loss 
function, i.e. to the procedure accepting Hq if and 
only if 

P* [6oM >F X [0i |a;] . 



In less formal terms, the null hypothesis is accepted 
if it is more probable under the posterior distribu- 
tion than under the alternative, which is a very in- 
tuitive answer. To constrain the impact of the prior 
probabilities, a different quantity is usually adopted, 



namely the Bayes factor (Kass and Raftery 1995) 



which is defined by Jeffreys (1939), Jaynes (2003) as 



B, 



01 



7r(e o |a0 /tt(0o) 



f(x\9)TT (6)de 



TT{Gl\x)/ 7T(ei 



/(a;|0)7ri(0)d0 



&i 



Note that the posterior odds can be recovered from 
the Bayes factor by assigning the appropriate prior 
probabilities on each of both models, contradicting 



the criticism of Templeton ( 2008 ) that the Bayes fac- 
tor is not scaled in probability terms. Interestingly 
Bio — V-Boi) hence there is no asymmetry in the 
definition and construction of Bayes factor, contrari- 
wise to the Neyman-Pearson approach. We do not 
believe that this is a drawback and would rather ques- 
tion the interest in forcing such an asymmetry in the 
Neyman-Pearson tests. 

The Bayes factor, a monotonic transform of the 
posterior probability of H which eliminates the in- 
fluence of the prior weight 7r(©o), has a similar inter- 
pretation to the classical likelihood ratio. As noted 
in the previous section, by integrating out the pa- 
rameters within each hypothesis, the uncertainty on 
each parameter is taken into account, which induces 
a natural penalisation for richer models, as intuited 



by Jeffreys (1939) (variation is random until the con- 
trary is shown; and new parameters in laws, when 
they are suggested, must be tested one at a time, un- 
less there is specific reason to the contrary). Although 
we strongly dislike using the term because of its unde- 
served weight of academic authority, the Bayes factor 
acts as a natural Ockham's razor. The well-known 
connection with the BIC (Bayesian information cri- 
terion, see Robert 2001| Chapter 5), with a penalty 



term of the form dlogn/2, makes explicit the penali- 
sation induced by Bayes factors in regular parametric 
models. However it goes beyond this class of models, 
and in much greater generality, the Bayes factor cor- 
responds asymptotically to a likelihood ratio with a 
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penalty of the form d* logn*/2 where d* and n* can 
be viewed as the effective dimension of the model and 
the effective number of observations, respectively, see 



(Berger et al. 2003 Chambaz and Rousseau 2008) 



The Bayes factor therefore offers the major inter- 
est that it does not require to compute a complexity 
measure (or penalty term) — in other words, to define 
what is d* and what is n* — , which often is quite com- 
plicated and may depend on the true distribution. 

5.2 Difficulties 

The inferential problems of Bayesian model selection 
and of Bayesian testing are clearly those for which the 
most vivid criticisms can be found in the literature: 
witness Senn (20081 who states that the Jeffreys- 
subjective synthesis betrays a much more dangerous 
confusion than the Neyman-F 'ear son-Fisher synthe- 
sis as regards hypothesis tests. We find this suspicion 
rather intriguing given that the Bayesian approach is 
the only one giving a proper meaning to the proba- 
bility of a null hypothesis, V(Hq\x), since alternative 
methodologies can at best specify a probability value 
on the sampling space, i.e. on the wrong dual space. 

If we consider the special case of point null 
hypotheses — which is not such limited a scope since it 
includes all variable selection setups — , there is a dif- 
ficulty with using a standard prior modelling in this 



environment. As put by Jeffreys ( 1939 ), when consid- 
ering whether a location parameter a is [when] the 
prior is uniform, we should have to take f(a) — and 
Bio would always be infinite. This is therefore a case 
when the inferential question implies a modification 
of the prior, justified by the information contained in 
the question. While avoiding the whole issue is a so- 



lution, as with Gelman (20081 having no patience for 



statistical methods that assign positive probability to 
point hypotheses of the = type that can never ac- 
tually be true, considering the null and the alternative 
hypotheses as two different models allows for a Bayes 
factor representation and corresponds to assigning a 
positive probability to the null hypothesis. 

In our view, one of the major drawbacks of Bayes 
factors - or even posterior odds - is that they can- 
not be used under improper priors, for lack of proper 
normalising constants. This is even more acute a dif- 



ficulty than what is described in Section [4j because 
the Bayes factor is simply not defined under improper 
priors, for any sample size. Solutions have been pro- 
posed, akin to cross-validation techniques in the clas- 
sical domain ( Berger and Pericchi|199 6, Berger et al. 
1998a), but they are somehow too ad- hoc to con- 



vince the entire community (and obviously beyond). 
In some situations, when parameters shared by both 
models have the same meaning in each of the models, 
an improper prior can be used on these parameters, 
in both models. 

For instance, when considering variable selection 
in a regression model, 

y\X,p,a~N{Xp,a 2 I n ), 

e.g. when deciding whether or not the null hypothe- 
sis Hq : p± = holds, the relevant non informative 



prior distribution is Zellner's (1986) g-prior, where 
n(/3\a) corresponds to a normal Af(0, ner 2 (X T X) _1 ) 
distribution on /3 and a "marginal" improper prior on 
er 2 , 7r(cr 2 ) = c~ 2 . This means that, when considering 
the submodel corresponding to the null hypothesis 
Hq : Pi = 0, with parameters fi^ 1 ' and a, we can 
also use the "same" <?-prior distribution 







(-i)l 



where X_! denotes the regression matrix missing the 
column corresponding to the first regressor, and a 2 ~ 
7r(cr 2 ) = a~ 2 . Since a is a nuisance parameter in this 
case, we may use the improper prior on a 2 as common 
to all submodels and thus avoid the indeterminacy in 
the normalising factor of the prior when computing 
the Bayes factor 



-Boi 



J f(y\(3_i,a,X)7r(l3^\a,Xi 
J f(y\(3,a,X)n(f3\a,X] 



d/3_i da 



d/3da 



Figure |3| rep roduces an output from Marin and 
Robert (2007) that illustrates how this default prior 



and the corresponding Bayes factors can be used in 
the same spirit as significance levels in a standard 
regression model, each Bayes factor being associated 
with the test of the nullity of the corresponding re- 
gression coefficient. For instance, only the intercept 
and the coefficients of Xi, X2, X4, X$ are significant. 
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F^ti m^tp 


BF 


lofflOCBFI 


i I ntcrcept i 


Q 971 A 




1 A9DR t*** 


XI 


-0.0037 


7.0839 


0.8502 (**) 


X2 


-0.0454 


3.6850 


0.5664 (**) 


X3 


0.0573 


0.4356 


-0.3609 


X4 


-1.0905 


2.8314 


0.4520 (*) 


X5 


0.1953 


2.5157 


0.4007 (*) 


X6 


-0.3008 


0.3621 


-0.4412 


X7 


-0.2002 


0.3627 


-0.4404 


X8 


0.1526 


0.4589 


-0.3383 


X9 


-1.0835 


0.9069 


-0.0424 


X10 


-0.3651 


0.4132 


-0.3838 



evidence against HO: (****) decisive, (***) strong, (**) 
substantial, (*) poor 

Figure 3: R output of a Bayesian regression analysis 
on a processionary caterpillar dataset with ten co- 



variates analysed in Marin and Robert (2007). The 



Bayes factor on each row corresponds to the test of 
the nullity of the corresponding regression coefficient. 



This output mimics the standard Im R function out- 
come in order to show that the level of information 
provided by the Bayesian analysis goes beyond the 
classical output, not to show that we can get simi- 
lar answers to those of a least square analysis since, 
else, if the Bayes estimator has good frequency be- 
haviour then we might as well use the frequentist 
method ( |Wasserman||2008| . (While computing issues 
are addressed in the following Chapter, we stress that 
all items in the table of Figure [3] are obtained via 
closed form formulae.) 

The major criticism addressed to the Bayesian ap- 
proach to testing is therefore that it is not inter- 
pretable on the same scale as the Neyman-Pearson- 
Fisher solution, namely in terms of probability of 
Type I error and of power of the tests. In other 
words, frequentist methods have coverage guarantees; 
Bayesian methods don't; 95 percent frequentist inter- 
vals will live up to their advertised coverage claims 
(Wasserman 20081. A natural question is then to 



question the appeal of such frequentist properties 
when considering a single dataset, i.e. in Jeffreys' 
(1939) famous words, a hypothesis that may be true 



may be rejected because it had not predicted observable 
results that have not occurred, especially when con- 
sidering that p- values may be inadmissible estimators 
( Hwang et al.|19 92). From a decisional perspective — 
with which the frequentist properties should relate — , 
a classical Neyman-Pearson-Fisher procedure is never 
evaluated in terms of the consequences of rejecting 
the null hypothesis, even though the rejection must 
imply a subsequent action towards the choice of an 
alternative model. Therefore, complaining that hav- 
ing a high relative probability does not mean that a 



hypothesis is true or supported by the data (Temple- 



ton 



2008), simply because the Bayesian approach is 
relative in that it posits two or more alternative hy- 
potheses and tests their relative fits to some observed 
statistics ( Templeton|2008 1 , is missing the main pur- 
pose of tests, which is not to validate or invalidate 
a golden model per se but rather to infer a working 
model that allows for acceptable predictive proper- 
tiesQ 

5.3 Model choice 

For model choice, i.e. when several models are under 
comparison for the same observation 



fi(x\6i) 



where 3 can be finite or infinite, the usual Bayesian 
answer is similar to the Bayesian tests as described 
above. The most coherent perspective (from our 
viewpoint) is actually to envision the tests of hy- 
potheses as particular cases of model choices, rather 
than trying to justify the modification of the prior 
distribution criticised by Gelman (2008). This also 



also to incorporate within model choice the alterna- 
tive solution of model averaging, proposed by |Madi- 
gan and Raftery ( |1994[ ), which strives to keep all pos- 
sible models when drawing inference. 

The idea behind Bayesian model choice is to con- 
struct an overall probability on the collection of mod- 
els UigjSDti in the following way: the parameter is 
6 = (i,9i), i.e. the model index and given the model 



1 It is worth repeating the earlier assertion that all models 
are false and that finding that a hypothesis is "true" is not 
within our reach, if at all meaningful! 



12 



index equal to i, the parameter 9i in model SUt^, then when applicable, directly derived from Bayes theo- 
the prior measure on the parameter 9 is expressed as rem, via Chib's (1995) rendering: 



d7r(d) = 5^p i d7r i (d i ), 



iej 



53 Pi = 1 > 



where both the 7iVs and p^s are part of the prior 
modelling, hence chosen by the experimenter. (The 
7iVs have the natural interpretation of the traditional 
prior under model £ITtj, while the p^s correspond to 
the prior assessment of the models under compari- 
son.) As a consequence, the Bayesian model selection 
associated with the 0-1 loss function and the above 
prior is the model that maximises the posterior prob- 
ability 



Eft / fM^MWj 

J8< 



across all models. Contrary to classical pluggin likeli- 
hoods, the marginal likelihoods involved in the above 
ratio do compare on the same scale and do not re- 
quire the models to be nested: the criticism that com- 
plicating dimensionality of test statistics is the fact 
that the models are often not nested, and one model 
may contain parameters that do not have analogues 
in the other models and vice versa ( |Templeton|[2008| 
is not founded. As mentioned in Section [4] integrat- 
ing out the parameters 0i in each of the models takes 
into account their uncertainty thus the marginal like- 
lihoods J e fi(x\9i)ni(9i)d6i are naturally penalised 
likelihoods. In many setups, the Bayesian model se- 
lector as defined above is consistent, i.e. as the num- 
ber of observations increases the probability of choos- 
ing the right model goes to 1. 

5.4 Other issues 

The computational requirements related to handling 
a collection of marginal likelihoods will be addressed 
in the following Chapter, in connection with the 
review of classical solutions in Robe rt and Ma rin 



(2010). Interestingly enough, the most accurate ap- 



m(x) = 



n(9)f(x\9) ^ *(0)f(x\0) 
ir(9\x) ~ 7r(0|a;) 



where 7r(0|x) is a simulation-based approximation 
to the posterior density based on simulated latent 
variables. Marin and Robert (2008) illustrate this 



method in the setting of mixtures and |Robert and| 



Marin (20101 in the alternative case of a probit 



model, respectively, both of which demonstrate the 
precision of this approximation^] 

Posterior odds and Bayes factors are the most com- 
mon Bayesian approaches to testing, however they 
are not the only ones. In particular the choice of 
the 0-1 loss function is not necessarily relevant or 
the most relevant. In some situations it might be 
more interesting to penalise the loss with the dis- 
tance to the null hypothesis for instance, see |Robert| 



and Rousseau (2002), Rousseau (2007) where such 



ideas are applied to goodness of fit tests or Bernardo 



(2009). 



6 On pervasive computing 

Bayesian analysis has long been derided for providing 
optimal answers that could not be computed. With 
the advent of early Monte Carlo methods, of per- 
sonal computers, and, more recently, of more pow- 



erful Monte Carlo methods (Hitchcock 20031, the 



pendulum appears to have switched to the other ex- 
treme and Bayesian methods seem to quickly move 



to elaborate computation (Gelman 2008), a feature 
that does not make them less suspicious: a simu- 
lation method of inference hides unrealistic assump- 
tions (|Templeton||2008|) . The simulation techniques 



that have done so much to promote Bayesian analysis 
in the past decades are detailed in the next Chap- 
ter and thus not described here. We nonetheless 



proximation technique for marginal likelihoods is, 



2 There have been discussions about the accuracy of this 
method in multimodal settings ( Friihwirth-Schnattcr 2004), 
but straightforward modifications jBerkhof et al.||2003| |Lee| 
|et a.l.||2008"t overcome such difficulties and make for both an 
easy and a well-grounded computational tool associated with 
Bayes factors. 
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want to point out that, while simulation methods 
can be misused — as about any other methodology — 
and while Bayesian simulation seems stuck in an 
infinite regress of inferential uncertainty ( Gelman 



2008), there exist enough convergence assessment 



techniques (Robert and Casella 2009) to ensure a 



reasonable confidence about the approximation pro- 
vided by those simulation methods. Thus, as rightly 



stressed by Bernardo (2008), the discussion of com- 



putational issues should not be allowed to obscure the 
need for further analysis of inferential questions^ 

The field of Bayesian computing is therefore very 
much alive and, while its diversity can be construed 
as a drawback by some, we do see the emergence of 
new computing methods adapted to specific applica- 
tions as most promising, because it bears witness to 
the growing involvement of new communities of re- 
searchers in Bayesian advances. 
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